5/17/2017

Overview

  1. What is Size-Biased Data?
  2. Scientific Background for Mitochondria
  3. Goals for this project
  4. How the sampling process caused size-biased data?
  5. Investigate Possible Estimators with simulation study
  6. Use the best ones on real data
  7. Conclusion
  8. Discussion

Story about Size-Biased Data



Scientific Background for Mitochondria

Goals for this project

  1. Whether Properties (area, perimeter, circularity and aspect ratio) of mitochondria are different by locations (proximal, middle and distal end).
  2. Suggestions on sampling method for future research (more cells).

Sampling Process - 1

  • A young muscle fiber cell was magnifired to 166 different images by using Transmission Electron Microscope (TEM).


Sampling Process - 2

  • For each location, divide images into two groups:
    Subsarcolemmanl and Interfibrillar group (ignore later).
  • In each group, randomly pick one image.
  • In each image, sample 20 mitochondria.

Sampling Process - 3

  • Generate a list of random coordinates.
  • Pick the mitochondria whose area in the photo includes one or more generated coordinates.

Raw Data

Raw Data

  • Area \(({\mu m}^{2})\):
    The area occupied by a mitochondrion in an image.
  • Perimeter \((\mu m)\):
    The length of the boundary of a mitochondrion in an image.
  • Circularity:
    Circularity is equal to \(\frac{4 \pi Area}{Perimeter^2}\).

    (Measuring the resemblance of a mitochondrion to a circle. The range of circularity is between 0 and 1. 1 means a perfect circle.)

  • Aspect Ratio:
    Aspect Ratio is equal to \(\frac{Length}{Width}\).

    (If \(AR \leq 2\), it is considered short; if \(2 < AR \leq 4\), intermediate; if \(AR > 4\), long.)

Data Exploration: Area

Data Exploration: Perimeter

Data Exploration: Circularity

Data Exploration: Aspect Ratio

Data Exploration: Scatter Plots



Best Estimators

  • Circularity:
    Arithmetic Mean
  • Aspect Ratio:
    Arithmetic Mean

Problems from the Sampled Data

  1. It is NOT random sample but size-biased!
  2. The larger mitochondria are easier to be picked in our sample.
  3. If we used sample mean as our population mean estimator, it will definitely be overestimated!

New Goals for this project

  1. What is the appropriate estimator for the size-biased data?
    A: Simulation Study for finding the best estimator.
  2. Whether Properties of mitochondria are different by locations.
    A: Permutation Test and Bootstrapping Confidence Interval
  3. Suggestions on sampling scheme for future research.
    A: Based on the Simulation Study.

Weighted Distribution


  1. Cox (1962) proposed an idea of Weighted Distribution, \[{f}^{\ast}(x)=\frac{w(x)f(x)}{{E}_{f}(w(x))}.\]
  2. Cox (1962) also proposed the Harmonic Mean (\(\frac{n}{\sum_{i=1}^{n}\frac{1}{{x}_{i}}}\)) as an estimator of population mean of \(X\), and proved that it will converge to \(\mu={E}_{f}(x)\) as \(n \to \infty\) when \(w(x)=x\).

Simulation Study - Area

  • Assume the true distribution, \(Area\; \sim \; Exp(\theta) = f(A)\).
  • Then the observed distribution, \(Area\; \sim \; Gamma(2,\theta) = f^{*}(A)\).
  • The red dash line is \(Gamma(2, \widehat{\theta}),\) where \(\widehat{\theta} = \frac{\bar{a}}{2} \doteq 1183\)

Candidate Estimators - Area

  1. Arithmetic Mean (AM) \[\frac{\sum_{i=1}^{n}{a}_{i}}{n}\]
  2. Weighted Mean (WM) or Harmonic Mean

    \[ \frac{\sum_{i=1}^{n}{w}_{i}{a}_{i}}{\sum_{i=1}^{n}{w}_{i}}=\frac{n}{\sum_{i=1}^{n}\frac{1}{{a}_{i}}}\;,\;\;\text{where}\;\; {w}_{i}=\frac{1}{{p}_{i}}=\frac{n\bar{a}}{{a}_{i}} \]
  3. Maximum Likelihood Estimator (MLE)

    \[\frac{\sum_{i=1}^{n}{a}_{i}}{2n}=\frac{AM}{2}\]

Simulation Study - Area (Overview)

  • Simulate mitochondria data in a muscle fiber cell.
  • Sample from finite population (\(\mathbf{N}\)) rather than infinite population.
  • Do both sampling with replacement and without replacement.
  • Sample size (\(\mathbf{n}\)) is decided by the \(\mathbf{Ratio}\) between \(\mathbf{N}\) and \(\mathbf{n}\).

Simulation Study - Area

  1. Assume \(Area \sim Exp(\mu)\),
    Set \(N = 2000\),
    \(Ratio\) = \((5\%, 10\%, 30\%, 50\%, 70\%, 95\%)\),
    \(Repeated\;Times = 1000\),
    \(\mu = 1000\).
  2. Generate \(N\) samples from \(Exp(\mu)\) as subpopulation of Area
    Calculate subpopulation mean, \(\mu_A\)(This is what we are interested in!).
  3. Sample a set of samples (\(n\)) from subpopulation (\(N\)) with sampling probability proportional to the value of Area with and without replacement (\(n = N \times Ratio\)).

Simulation Study - Area

  1. For each set of samples, calculate the candidate estimators: Arithmetic Mean (AM), Weighted Mean (WM) and Maximum Likelihood Estimator (MLE).
  2. Repeat 3. 4. for the set \(Repeated\;Times\) for each \(Ratio\).
  3. Calculate the Mean, Standard Deviation and Root MSE for each candidate estimator.
    Draw plots of sampling distributions for each candidate estimator.

Results of Simulation Study - Area

Best Estimators - Area

  • Sampling "WITH" Replacement:
    Weighted Mean and MLE
  • Sampling "WITHOUT" Replacement:
    Unfortunately, not clear yet.

Simulation Study - Perimeter

  • \(Perimeter =\sqrt{4\pi}\sqrt{\frac{Area}{Circularity}}\)
  • \(Area \perp Circularity\).
  • The observed distribution of \(Circularity\;\sim\;Beta(15,5)\).
  • Assume that the true distribution of \(Circularity\;\sim\;Beta(\alpha, \beta)\).
  • The red dash line is \(Beta(15, 5)\).


Candidate Estimators - Perimeter

  1. Arithmetic Mean (AM) \[\frac{\sum_{i=1}^{n}{p}_{i}}{n}\]
  2. Weighted Mean (WM) \[\frac{\sum_{i=1}^{n}{w}_{i}{p}_{i}}{\sum_{i=1}^{n}{w}_{i}}\;,\;\;\text{where}\;\; {w}_{i}=\frac{n\bar{a}}{{a}_{i}}\]
  3. Delta Method Esitmator (DME) \[\sqrt{4\pi}\sqrt{\frac{\bar{a}/2}{\bar{c}}}\]
  4. 2nd Order Taylor's Approximation Estimator (2TAE) \[\sqrt{4 \pi}\left[ \sqrt{\frac{\bar{a}/2}{\bar{c}}} - \frac{1}{8} (\frac{\bar{a}}{2})^\frac{-3}{2}(\bar{c})^\frac{-1}{2}\frac{{s}_{a}^2}{2}+\frac{3}{8}(\frac{\bar{c}}{2})^\frac{1}{2}(\bar{c})^\frac{-5}{2}{s}_{c}^2\right]\]

Simulation Study - Perimeter

  1. Generate the finite subpopulation (\(\mathbf{N}\)) data from \(Circularity \sim Beta(15,5)\).
  2. Plug the generated Area and Circularity data into formula to obtain subpopulation of Perimeter.
  3. Sample from the subpopulation (\(N\)) with sampling probability proportional to Area with and without replacement.
  4. See the performance of the candidates estimators: Arithmetic Mean (AM), Weighted Mean (WM), Delta Method Esitmator (DME), 2nd Order Taylor's Approximation Estimator (2TAE).

Simulated Data - Perimeter

Results of Simulation Study - Perimeter

Best Estimators - Perimeter

  • Sampling "WITH" Replacement:
    Weighted Mean and 2TAE
  • Sampling "WITHOUT" Replacement:
    Unfortunately, not clear yet.

Hypothesis Test

  • Overall Hypothesis Test:

\[ \begin{align*} {H}_{0} &: {\mu}_{{i}_{P}} = {\mu}_{{i}_{M}} = {\mu}_{{i}_{D}}\\ {H}_{A} &: \text{At least one} \: {\mu}_{{i}_{j}} \neq {\mu}_{{i}_{k}} \end{align*} \]

  • Pairwise Comparison Test:

\[ \begin{align*} {H}_{0} &: {\mu}_{{i}_{j}} = {\mu}_{{i}_{k}} \\ {H}_{A} &: {\mu}_{{i}_{j}} \neq {\mu}_{{i}_{k}} \\ \end{align*} \] \[ \begin{align*} i &= \left \{ \text{Area, Perimeter, Circularity, Aspect Ratio} \right \} \\ j,k & = \left \{ \text{P, M, D} \right \} \end{align*} \]

Hypothesis Test : Permutation Test

  • Reasons:
    • Area and Perimeter are size-biased.
    • Circularity and Aspect Ratio, the data violated the normality assumption of ANOVA and T-test.

  • Overall Test (Permutation Test of ANOVA):
    • \(\sum_{i=\left\{P,M,D\right\}}{(\widehat{{\mu}_{i}}-\widehat{{\mu}})}^{2}\)
    • significance level = \(5\%\)

  • Pairwise Comparison Test (Permutation Test of T-test):
    • \(\widehat{\mu}_{i}-\widehat{\mu}_{j}\), where \(i=\left \{P,M,D\right\}\)
    • Bonferroni’s correction: significance level = \(\frac{5\%}{3} = 0.0167\).

Results for the Hypothesis Test

Bootstrapping CI for Means

Bootstrapping CI for the differences

Conclusions

  1. Middle part of the muscle fiber cell have larger Area, Perimeter and Circularity.
    • Area: \(\underline{M > P} > D\)
    • Perimeter: \(\underline{M > P} > D\)
    • Circularity: \(\underline{M > P}>D\) & \(M > \underline{P > D}\)
    • Aspect Ratio: \(\underline{M > P > D}\)

  2. The appropriate estimator for the size-biased data is Non-parametric Weighted Mean.

  3. Suggest to use Sampling With Replacement (SWR) rather than Sampling Without Replacement (SWOR) in their future sampling scheme.
    Weighted Mean does not preform well in the case of SWOR when Ratio of \(N\) and \(n\) is larger than \(10\%\).

Future Work

  • Find the best estimator for SWOR.
  • Robustness of the distribution assumptions can be an interesting topic.
    The Nonparametric Weighted Mean had notably different results with the Parametric Estimators (MLE for Area and 2TAE for Perimeter). Maybe it is because of improper distribution assumptions on Area and Circularity.
  • Include the effect of Subsarcolemmanl and Interfibrillar group and even possible interaction.

References

  • Bratic, Ana and Larsson, Nils-Gran. “The Role of Mitochondria in Aging.” Journal of Clinical Investigation 123, no. 3 (2013): 951-57.
  • Cox, D. R. Renewal Theory. London: Methuen, 1962.
  • Patil,G. P. and Ord,J. K. “On Size-Biased Sampling and Related Form-Invariant Weighted Dis- tributions.” Sankhya. Series B 38,48-61.
  • Jones, M. C. “Kernel Density Estimation for Length Biased Data.” Biometrika. Vol. 78, No. 3 (Sep., 1991), pp. 511-519

Photos

The end

Questions?